Table of Contents

Dataset

My Iranian churn dataset was missing so I decided to choose the Seoul bike sharing demand dataset dataset found on UC Irvine Machine Learning Repository.

Abstract

The dataset contains count of public bikes rented at each hour in Seoul Bike haring System with the corresponding Weather data and Holidays information.

Data Set Information

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.

Objectives

The objective of this notebook is to analyze the dataset I get assigned and to make from there:

  1. A powerpoint explaining the ins and outs of the problem, my reflections on the question raised, the different variables I have created, how the problem fits into the context of the study. In this case, I will try to predict the number of bikes rented in Seoul based on temporal and meteorological features.
  2. A python code :
    • Data-visualization - showing the link between the variables and the target
    • Modeling - take scikit-learn and try several algorithms, change hyper parameters, make a search grid, compare results of my models in graphs.
  3. Transformation of the model into a Django API

Imports

Pythons libs

We can import here all the libraries we need for the notebook.

In [1]:
import json
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [12, 5] # default = [6.0, 4.0]
plt.rcParams['figure.dpi']     = 200     # default = 72.0
plt.rcParams['font.size']      = 7.5     # default = 10.0

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, ShuffleSplit

from sklearn.utils import all_estimators
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import HuberRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import ARDRegression
from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings("ignore")

Dataset

First, I can import the dataset and show some lines to see what the features are and their format.

In [2]:
df = pd.read_csv("SeoulBikeData.csv")
df
Out[2]:
Date Rented Bike Count Hour Temperature(C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Functioning Day
0 01/12/2017 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
1 01/12/2017 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
2 01/12/2017 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter No Holiday Yes
3 01/12/2017 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
4 01/12/2017 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter No Holiday Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 30/11/2018 1003 19 4.2 34 2.6 1894 -10.3 0.0 0.0 0.0 Autumn No Holiday Yes
8756 30/11/2018 764 20 3.4 37 2.3 2000 -9.9 0.0 0.0 0.0 Autumn No Holiday Yes
8757 30/11/2018 694 21 2.6 39 0.3 1968 -9.9 0.0 0.0 0.0 Autumn No Holiday Yes
8758 30/11/2018 712 22 2.1 41 1.0 1859 -9.8 0.0 0.0 0.0 Autumn No Holiday Yes
8759 30/11/2018 584 23 1.9 43 1.3 1909 -9.3 0.0 0.0 0.0 Autumn No Holiday Yes

8760 rows × 14 columns

We canc clearly see that out dataframe deserve some little changes to be more suitable to work with.

  • The columns names are too long, so I can change their names to be shorter without loosing to much information
  • The Holiday and Functionning Day columns are booleans but in a string format so I can change this to have a 0 or 1 format
  • The Date format is a string with dd/mm/yyyy format. Let's change it to standart format yyyy/mm/dd and date type
In [3]:
shorter_column_name = [
 'date',    # Date
 'n_bike',  # Rented Bike Count
 'hour',    # Hour
 'temp',    # Temperature (°C)
 'hum',     # Humidity (%)
 'wind',    # Wind speed (m/s)
 'visb',    # Visibility (10m)
 'dew',     # Dew point temperature (°C)
 'solar',   # Solar Radiation (MJ/m2)
 'rain',    # Rainfall(mm)
 'snow',    # Snowfall (cm)
 'season',  # Seasons
 'holiday', # Holiday
 'func'     # Functioning Day
]
df.columns = shorter_column_name
df['func'] = (df['func'] == "Yes").astype(int)
df['holiday'] = (df['holiday'] == "Holiday").astype(int)
df['date'] = pd.to_datetime(df['date'], format="%d/%m/%Y")
df
Out[3]:
date n_bike hour temp hum wind visb dew solar rain snow season holiday func
0 2017-12-01 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter 0 1
1 2017-12-01 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter 0 1
2 2017-12-01 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter 0 1
3 2017-12-01 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter 0 1
4 2017-12-01 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 2018-11-30 1003 19 4.2 34 2.6 1894 -10.3 0.0 0.0 0.0 Autumn 0 1
8756 2018-11-30 764 20 3.4 37 2.3 2000 -9.9 0.0 0.0 0.0 Autumn 0 1
8757 2018-11-30 694 21 2.6 39 0.3 1968 -9.9 0.0 0.0 0.0 Autumn 0 1
8758 2018-11-30 712 22 2.1 41 1.0 1859 -9.8 0.0 0.0 0.0 Autumn 0 1
8759 2018-11-30 584 23 1.9 43 1.3 1909 -9.3 0.0 0.0 0.0 Autumn 0 1

8760 rows × 14 columns

Then, as we have only one categorical variable season (for now), we can display a short descriptions of the numerical features. Here we have :

  • n_bike : Corresponding to the number of bike rented in Seoul for a specific hour
  • (date, hour) : describe the time (unique) of the sample
  • func : describe if the rental if functionnal for the specific hour
  • the other feautures are meteorological description of Seoul at this time.

We will go deeper in these descriptions during the analysis.

In [4]:
df.describe()
Out[4]:
n_bike hour temp hum wind visb dew solar rain snow holiday func
count 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000
mean 704.602055 11.500000 12.882922 58.226256 1.724909 1436.825799 4.073813 0.569111 0.148687 0.075068 0.049315 0.966324
std 644.997468 6.922582 11.944825 20.362413 1.036300 608.298712 13.060369 0.868746 1.128193 0.436746 0.216537 0.180404
min 0.000000 0.000000 -17.800000 0.000000 0.000000 27.000000 -30.600000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 191.000000 5.750000 3.500000 42.000000 0.900000 940.000000 -4.700000 0.000000 0.000000 0.000000 0.000000 1.000000
50% 504.500000 11.500000 13.700000 57.000000 1.500000 1698.000000 5.100000 0.010000 0.000000 0.000000 0.000000 1.000000
75% 1065.250000 17.250000 22.500000 74.000000 2.300000 2000.000000 14.800000 0.930000 0.000000 0.000000 0.000000 1.000000
max 3556.000000 23.000000 39.400000 98.000000 7.400000 2000.000000 27.200000 3.520000 35.000000 8.800000 1.000000 1.000000

Analysis and preprocessing

Missing Values

Let's start our analysis dealing with missing values. To be sure the dataset is cleaned, I compute the number of missing values for each columns on the entire dataset.

In [5]:
df.isnull().sum().reset_index(name='NA_count').sort_values(by='NA_count', ascending=False)
Out[5]:
index NA_count
0 date 0
1 n_bike 0
2 hour 0
3 temp 0
4 hum 0
5 wind 0
6 visb 0
7 dew 0
8 solar 0
9 rain 0
10 snow 0
11 season 0
12 holiday 0
13 func 0

Surprisingly, there is no missing value (rare event). The dataset is very clean and homogeneous.

Functionning Day

Before to dive into the analysis, let's explore the number of rented bikes on hours according to whether or not this is a functionning day.

In [6]:
df.groupby(['func']).agg({'n_bike': ['mean', 'std', 'count']}).reset_index()
Out[6]:
func n_bike
mean std count
0 0 0.000000 0.000000 295
1 1 729.156999 642.351166 8465

We can see that the number of rented bike is 0 for hours when this is a non functionning day. In other words, we can not train a model on the features to predict ifthe number of rented bikes per hours based on the non functionning days. So I remove these samples from the dataset.

In [7]:
N = len(df)
print(f'Dataset number of samples before non-functionning days deletion: {len(df)}')
print(f'Number of non-functionning days: {len(df[df.func != 1])}')
df = df[df.func == 1].drop('func', axis=1)
print(f'Dataset number of samples after  non-functionning days deletion: {len(df)}')
Dataset number of samples before non-functionning days deletion: 8760
Number of non-functionning days: 295
Dataset number of samples after  non-functionning days deletion: 8465

Rented Bike Count

As I said before, the target will be the number of bike rented. More precisely, each sample gives the number of bike rented for each hours from 2017-12-01 to 2018-11-30. We can conclude it is a regression task to predict the number of bikes rented for a given hour based on the temporal and meteorologiical features the dataset contains.

For these hours, I want to see the distribution of the number of rented bikes.

In [8]:
N = len(df) // 100

plt.hist(x=df.n_bike, bins=N, color='#8800ff', alpha=0.5)
plt.title('Distribution of rented bike count per hour on fuctionning days')
plt.xlabel('n_bike')
plt.ylabel('count')
pass

TRANSFORMATION REMINDER >

Square Root

The square root method is typically used when your data is moderately skewed. Now using the square root is a transformation that has a moderate effect on distribution shape. It is generally used to reduce right skewed data. Finally, the square root can be applied on zero values and is most commonly used on counted data.

Logarithmic

The logarithmic is a strong transformation that has a major effect on distribution shape. This technique is, as the square root method, oftenly used for reducing right skewness. However, is that it can not be applied to zero or negative values.

Here we can see the distribution is right skewed. A square root or log transformation is needed to standize this distribution. I will choose a square root transformation because don't want a high effect because of the values from approximatively 500 to 1200. We see on this range 2 gaussian-like subdistribution.

(For the example I will plot both the distribution of log and sqrt distribution)

In [9]:
fig, axes = plt.subplots(1, 2)

# sqrt transformation
axes[0].hist(x=np.sqrt(df.n_bike), bins=N, color='#2f9599', alpha=0.5)
axes[0].set_title('Distribution of rented bike count per hour on fuctionning days')
axes[0].set_xlabel('sqrt(n_bike)')
axes[0].set_ylabel('count')

# log transformation
axes[1].hist(x=np.log(df.n_bike), bins=N, color='#999999', alpha=0.5)
axes[1].set_title('Distribution of rented bike count per hour on fuctionning days')
axes[1].set_xlabel('log(n_bike)')
axes[1].set_ylabel('count')
pass

We can conclude a sqrt is most suitable for the n_bike distribution to have a more gaussian-like distribution. Moreover, we can clearly see the 2 gaussian like distributions I'll spoke about before are leading to a left skewed distribution with a log-transform.

In [10]:
df['n_bike'] = np.sqrt(df.n_bike)

Temporal features

Is it a time serie ?

The second interesting part is that we have a kind of time serie dataset. High is the temptation to deal with this dataset with this way of thinking. As this dataset is about bike rental in time, I am not convinced this is a time serie question. Let's plot the number of bike rented for each hour over time.

In [11]:
df.plot(x='date', y='n_bike', color='#2f9599', alpha=0.05, kind='scatter', label='hour of rent')


plt.plot(df.date, df.n_bike.rolling(7*24, center=True).mean(), color='#8800ff', alpha=0.5, label='Average bike rented on centered week') 
plt.plot(df.date, df.n_bike.rolling(30*24, center=True).mean(), color='#8800ff', label='Average bike rented on centered month') 
plt.legend()
plt.title('Distribution of rented bike count per hour over time')
plt.xlabel('day')
plt.ylabel('sqrt(n_bike)')
pass

As I said, even looking at this plot infer the idea of a time serie. But if we look at the plot with a better precision and with the data knowledge, we can see the number rented bike is not really a time serie problem

  • It varies according to the seasons, monthes as we can see with the average bike rented on centered month
  • It varies according to the days of the week as we can see with the average bike rented on centered week

With the knowledge of what a bike rent is, we can clearly suppose the number of rent is not given by the previous rents (even if a trend can be seen, only one year data is not sufficient to interpolate this trend). The number of rent is given by the week days habits and the seasons, thats why our meteorological features will be helpfull.

First, we can extract more informations from the date and hour to confirm my supposition. I extract :

  • Year (int)
  • Month (str)
  • Day (int)
  • Week day (str)
  • Working day (bool) : wether or not is is a monday to friday day when Coreans work
In [12]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month_name()
df['day'] = df['date'].dt.day
df['week_day'] = df['date'].dt.day_name()
df['working_day'] = (df['date'].dt.dayofweek < 5).astype(np.int)

df = df.drop('date', axis=1)
time_cols = ['year', 'month', 'day', 'week_day', 'working_day', 'season', 'holiday']
df[time_cols + ['n_bike']]
Out[12]:
year month day week_day working_day season holiday n_bike
0 2017 December 1 Friday 1 Winter 0 15.937377
1 2017 December 1 Friday 1 Winter 0 14.282857
2 2017 December 1 Friday 1 Winter 0 13.152946
3 2017 December 1 Friday 1 Winter 0 10.344080
4 2017 December 1 Friday 1 Winter 0 8.831761
... ... ... ... ... ... ... ... ...
8755 2018 November 30 Friday 1 Autumn 0 31.670175
8756 2018 November 30 Friday 1 Autumn 0 27.640550
8757 2018 November 30 Friday 1 Autumn 0 26.343880
8758 2018 November 30 Friday 1 Autumn 0 26.683328
8759 2018 November 30 Friday 1 Autumn 0 24.166092

8465 rows × 8 columns

My objective through the following analysis is to prove (or not) the time features are good categorical features (and not numerical ones for time series)

First I have a doubt on the efficience for the day number of the current month to give informations on the number of rented bikes.

Day number in month

In [13]:
plt.scatter(df.day, df.n_bike, color='#2f9599', alpha=0.05, label='hour of rent')

_temp = df.groupby('day').agg({'n_bike': ['mean', 'std']}).reset_index()
mean = _temp.n_bike['mean']
std = _temp.n_bike['std']
plt.plot(_temp.day, mean, color='#8800ff', label='mean') 
plt.fill_between(_temp.day, mean - std, mean + std, color='#8800ff', alpha=0.1, label='1-std interval') 
plt.legend()
plt.title('Distribution of rented bike count on day number')
plt.xlabel('day')
plt.ylabel('sqrt(n_bike)')
pass

To be honest, I was hoping there was a trend to rent more bikes at the beginning of the month because the salary is recent. But no. To be at the begining, at the middle or at the end of the month seems to have almost zero influence on our target.

Also, the year is not usable as the range of time is 2017-12-01 to 2018-11-30 so an exactly one-year window.

And what about the months and seasons ?

Month

In this part, I will be interested to see if the month (categorical) is a good feature to predict the number of bikes rented for a given hour in a given day. To do so, I display some aggregates for each months and I plot the notch-boxplot of distribution for the relatives number of bike rented.

In [14]:
# group by month to get box plot
groups = df.groupby(by='month', sort=False).n_bike
labels = groups.count().reset_index(name='count').month.values
groups = [groups.get_group(k) for k in groups.groups]

# custom colors
gray_dict = dict(color='#99999988', markeredgecolor='#999999')
colors = ['#8800ff88', '#2f959988']
boxplot = plt.boxplot(
    groups, notch=True, vert=True, patch_artist=True, labels=labels,
    capprops=gray_dict, whiskerprops=gray_dict, flierprops=gray_dict, medianprops=gray_dict
)
# change box colors
for i, patch in enumerate(boxplot['boxes']):
    patch.set_color(colors[i%len(colors)])

plt.title('Distribution of rented bike count according to the month')
plt.xlabel('month')
plt.ylabel('sqrt(n_bike)')
# display multiple aggregates and remove the temporary columns
df.groupby('month', sort=False).agg({'n_bike': ['count', 'sum', 'mean', 'min', 'max', 'median', 'std']}).apply(lambda row: [f'{x:.1f}' for x in row])
Out[14]:
n_bike
count sum mean min max median std
month
December 744.0 11101.8 14.9 1.7 30.6 15.4 5.1
January 744.0 10031.6 13.5 4.2 26.5 13.5 4.5
February 672.0 9453.1 14.1 2.6 30.5 14.0 5.3
March 744.0 15507.9 20.8 1.4 45.8 21.4 8.8
April 696.0 17243.7 24.8 1.4 53.0 26.1 11.8
May 720.0 20488.5 28.5 2.4 57.0 30.6 13.1
June 720.0 23970.0 33.3 3.0 59.6 33.8 11.7
July 744.0 21824.1 29.3 3.0 56.5 28.8 11.3
August 744.0 20719.7 27.8 3.2 53.3 27.1 10.0
September 624.0 19237.2 30.8 4.8 57.4 31.9 11.4
October 665.0 19457.7 29.3 1.4 53.5 31.0 11.1
November 648.0 16328.9 25.2 3.5 48.5 26.4 9.2

Here we clearly see monthes has an impact on the number of bikes rented. Moreover we can suppose the season will be a good resume of this plot. Indeed we see for the winter monthes less bikes rented and inversely.

Seasons

In this part, I will show the same things I did for the month to see if the seasons are good features and a good summary of the month-result above.

In [15]:
# group by season to get box plot
groups = df.groupby(by=['season'], sort=False).n_bike
labels = groups.count().reset_index(name='count').season.values
groups = [groups.get_group(k) for k in groups.groups]

# custom colors
gray_dict = dict(color='#99999988', markeredgecolor='#999999')
colors = ['#8800ff88', '#2f959988']
boxplot = plt.boxplot(
    groups, notch=True, vert=True, patch_artist=True, labels=labels,
    capprops=gray_dict, whiskerprops=gray_dict, flierprops=gray_dict, medianprops=gray_dict
)
# change box colors
for i, patch in enumerate(boxplot['boxes']):
    patch.set_color(colors[i%len(colors)])

plt.title('Distribution of rented bike count according to the season')
plt.xlabel('season')
plt.ylabel('sqrt(n_bike)')
# display multiple aggregates and remove the temporary columns
df.groupby('season', sort=False).agg({'n_bike': ['count', 'sum', 'mean', 'min', 'max', 'median', 'std']}).apply(lambda row: [f'{x:.1f}' for x in row])
Out[15]:
n_bike
count sum mean min max median std
season
Winter 2160.0 30586.4 14.2 1.7 30.6 14.2 5.0
Spring 2160.0 53240.1 24.6 1.4 57.0 24.5 11.8
Summer 2208.0 66513.8 30.1 3.0 59.6 30.1 11.3
Autumn 1937.0 55023.8 28.4 1.4 57.4 29.3 10.8

As I said, during the winter months there are less bike rented.

Holiday

In this part, I will do the same thing as the month and season features but on the holiday one.

In [16]:
# group by holiday to get box plot
groups = df.groupby(by=['holiday'], sort=False).n_bike
labels = ['No Holiday', 'Holiday']
groups = [groups.get_group(k) for k in groups.groups]

# custom colors
gray_dict = dict(color='#99999988', markeredgecolor='#999999')
colors = ['#8800ff88', '#2f959988']
boxplot = plt.boxplot(
    groups, notch=True, vert=True, patch_artist=True, labels=labels,
    capprops=gray_dict, whiskerprops=gray_dict, flierprops=gray_dict, medianprops=gray_dict
)
# change box colors
for i, patch in enumerate(boxplot['boxes']):
    patch.set_color(colors[i%len(colors)])

plt.title('Distribution of rented bike count according to the holiday')
plt.xlabel('holiday')
plt.ylabel('sqrt(n_bike)')
# display multiple aggregates and remove the temporary columns
df.groupby('holiday', sort=False).agg({'n_bike': ['count', 'sum', 'mean', 'min', 'max', 'median', 'std']}).apply(lambda row: [f'{x:.1f}' for x in row])
Out[16]:
n_bike
count sum mean min max median std
holiday
0 8057.0 197273.6 24.5 1.4 59.6 23.7 11.8
1 408.0 8090.5 19.8 1.7 49.0 16.1 11.7

It is not as clear as for the seasons or the months but there is a trend for less bike rented during holidays. We almost can conclude these bikes are rented to go to work.

Hour

If my previous conclusion is valid, then, by plotting the mean number of rented bikes according to the hour we will see two peaks around 8h00 and 18h00.

In [17]:
_temp = df.groupby('hour').agg({'n_bike': ['mean', 'std']}).reset_index()
mean = _temp.n_bike['mean']
std = _temp.n_bike['std']
plt.plot(_temp.hour, mean, color='#8800ff', label='mean') 
plt.fill_between(_temp.hour, mean - std, mean + std, color='#8800ff', alpha=0.1, label='1-std interval') 
plt.legend()
plt.title('Distribution of rented bike count according to day hours')
plt.xlabel('hour')
plt.ylabel('sqrt(n_bike)')
pass

As I supposed, we can see these peaks. So I can make a better supposition. This distribution will be improved for working days (monday to friday in Corea) and we probably can see another distribution for the weekend days.

Week days

For this part, I will plot the distribution of the number of rented bikes according to the hour. I will do it splitting the curves :

  • day by day
  • for working days and weekend days
In [18]:
fig, axes = plt.subplots(1, 2)
# Weekdays plot
_temp = df.groupby(['week_day', 'hour']).agg({'n_bike': ['mean', 'std']}).reset_index()
colors = ['#80f', '#80f8', '#2f9599', '#2f959988', '#999', '#ccc', '#aaa']
for week_day, color in zip(_temp.week_day.unique(), colors):
    mean = _temp.n_bike[_temp.week_day == week_day]['mean']
    axes[0].plot(_temp[_temp.week_day == week_day].hour, mean, color=color, label=f'{week_day} - mean') 
axes[0].legend()
axes[0].set_title('Distribution of rented bike count according to day hours')
axes[0].set_xlabel('hour')
axes[0].set_ylabel('sqrt(n_bike)')

# working days plot
_temp = df.groupby(['working_day', 'hour']).agg({'n_bike': ['mean', 'std']}).reset_index()
colors = ['#8800ff', '#2f9599']
for working_day, color in zip(_temp.working_day.unique(), colors):
    mean = _temp.n_bike[_temp.working_day == working_day]['mean']
    std = _temp.n_bike[_temp.working_day == working_day]['std']
    axes[1].plot(_temp[_temp.working_day == working_day].hour, mean, color=color, label=f'{"Is" if working_day else "Is not"} working day- mean') 
    axes[1].fill_between(_temp[_temp.working_day == working_day].hour, mean - std, mean + std, color=color, alpha=0.05, label=f'{"Is" if working_day else "Is not"} working day- 1-std interval') 
axes[1].legend()
axes[1].set_title('Distribution of rented bike count according to day hours')
axes[1].set_xlabel('hour')
axes[1].set_ylabel('sqrt(n_bike)')

pass

Bingo ! The distribution between :

  • the working days and weekend days are really distinct
  • the different working days are almost the same
  • the different weekend days are almost the same even there are more bike rented on Saturday than Sunday

To conclude : I was right saying the time serie idea is not as good as we can suppose first. The months, seasons, holidays, hours etc. as categorical are features are far most better features than a continuous time.

Meteorological features

In this part I will explain what are the other features. Only remains the meteorological ones.

Temperature and Dew point temperature

The dew point (°C) or dew temperature (°C) is the temperature below which condensation naturally occurs. More technically, below this temperature which depends on the ambient pressure and humidity, the water vapor contained in the air condenses on surfaces, by saturation effect.

Concretely, the more the dew point and the temperature are high, the more the bike rider will feel confortable (asssuming natural Corean temperatures).

I will deal with these two features simultaneously as we can assume these temperatures are correlated. Of course, I will plot a first scatter point to validate this assumption. Then I will plot the temperatures distribution to see if it need some transformation. Then the last plot is showing the relation between the temperatures and the number of bikes rented (already sqrt-transformed).

In [19]:
plt.rcParams['figure.figsize'] = [12, 10] # default = [6.0, 4.0]
N = len(df) // 100

# tempature and dew point comparison
ax = plt.subplot2grid((2, 2), (0, 0), colspan=2)
df.plot(x='temp', y='dew', kind='scatter', color='#8800ff', alpha=0.05, ax=ax, label='Given hour')
# add regression line
x, y = df.temp, df.dew
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression') 
# add legends
ax.set_title('Temperature and Dew point temperature comparison')
ax.set_xlabel('Temperature')
ax.set_ylabel('Dew point temperature')
ax.legend()

# Temperatures distributions
ax = plt.subplot2grid((2, 2), (1, 0), colspan=1)
ax.hist(x=df.temp, bins=N, color='#8800ff', alpha=0.3, label='Temperature')
ax.hist(x=df.dew, bins=N, color='#2f9599', alpha=0.3, label='Dew point')
# add legends
ax.set_title('Distribution of the temperatures')
ax.set_xlabel('Temperatures')
ax.set_ylabel('Count')
ax.legend()

# Temperatures and rented bikes distribution
ax = plt.subplot2grid((2, 2), (1, 1), colspan=1)
df.plot(x='temp', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Temperature at a given hour', ax=ax)
df.plot(x='dew', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='Dew point at a given hour', ax=ax)
# add regression lines
x, y = df.temp, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression of Temperature') 
x, y = df.dew, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression of Dew point') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the temperatures')
ax.set_xlabel('Temperatures')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
plt.rcParams['figure.figsize'] = [12, 5] # default = [6.0, 4.0]
pass

The two tempatures seems correlated. So it is a good point for me I can compare them. Moreover, their distribution is near a gaussian one, so I let these two features without transformation. Finally, we can see the concrete argument I said before : the more the dew point and the temperature are high, the more the bike rider will feel confortable. It leads to more bike rented

Humidity

Relative air humidity (%) is a measure of the ratio between the water vapor content of the air and its maximum capacity to contain water vapor under these conditions.

The relative humidity of the ambient air influences the evaporation of sweat, and thus the cooling of the body. Too low a humidity level increases cooling and increases the efficiency of perspiration, while too high a humidity level limits cooling and thus increases the feeling of warmth.

I will plot the humidity distribution to see if it need some transformation. Then the last plot is showing the relation between the humidity and the number of bikes rented (already sqrt-transformed).

In [20]:
N = len(df) // 100

# Humidity distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.hum, bins=N, color='#8800ff', alpha=0.3, label='Humidity')
# add legends
ax.set_title('Distribution of the Humidity')
ax.set_xlabel('hum')
ax.set_ylabel('Count')
ax.legend()

# Humidity and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='hum', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Humidity at a given hour', ax=ax)
# add regression line
x, y = df.hum, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Humidity')
ax.set_xlabel('hum')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

The humidity distribution is near a gaussian one (except near 100%), so I also let it without transformation. And, we can see the concrete argument I said before : the more the humidity is low, the more the bike rider will feel confortable. It leads to more bike rented.

Wind speed

For the wind speed (m/sec.), nothing special to say except that there is no strong wind which is considered to be from 14 m/sec. I out dataset the max is only 7.4 m/sec. so I am not sure this features will have a strong influence.

Yet, let's plot the wind speed distribution to see if it need some transformation. Then the last plot is showing the relation between the wind speed and the number of bikes rented (already sqrt-transformed).

In [21]:
N = len(df) // 100

# Wind speed distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.wind, bins=N, color='#8800ff', alpha=0.3, label='Wind speed')
# add legends
ax.set_title('Distribution of the Wind speed')
ax.set_xlabel('wind')
ax.set_ylabel('Count')
ax.legend()

# Wind speed and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='wind', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Wind speed at a given hour', ax=ax)
# add regression line
x, y = df.wind, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Wind speed')
ax.set_xlabel('wind')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

I expected to see that the wind speed will lead to less bikes rented but it is the opposite. I am suprised of this result but I can suppose it was not depending only on the wind but with crossed features, or maybe the wind can be confortable for the rider in some mesure. I also see that the wind speed has a right skewed distribution. So I will apply a log-transform on this feature.

In [22]:
df['wind'] = np.log(df['wind'] + 1)
In [23]:
N = len(df) // 200

# Wind speed distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.wind, bins=N, color='#2f9599', alpha=0.3, label='Wind speed')
# add legends
ax.set_title('Distribution of the Wind speed')
ax.set_xlabel('log(wind)')
ax.set_ylabel('Count')
ax.legend()

# Wind speed and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='wind', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='Wind speed at a given hour', ax=ax)
# add regression line
x, y = df.wind, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Wind speed')
ax.set_xlabel('log(wind)')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

Solar Radiation

As know, the solar radiance (W/m2) corresponds to a flow of radiation from the sun. In other words, the more there are radiation, the more we are exposed to the sun.

I plot the solar radiation distribution to see if it need some transformation. Then the other plot is showing the relation between the solar radiation and the number of bikes rented (already sqrt-transformed).

In [24]:
N = len(df) // 100

# Solar radiationdistributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.solar, bins=N, color='#8800ff', alpha=0.3, label='Solar radiation')
# add legends
ax.set_title('Distribution of the Solar radiation')
ax.set_xlabel('solar')
ax.set_ylabel('Count')
ax.legend()

# Solar radiation and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='solar', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Solar radiation at a given hour', ax=ax)
# add regression line
x, y = df.solar, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Solar radiation')
ax.set_xlabel('solar')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

Here three conclusions can be made :

  • The solar radiation is the more often ~zero, which means there is no sun
  • The more there are sun radiation, the more the bikers feel confortables so more bike rented.
  • (Hard to see because it is in small) The non~zero solar radiation is right skewed.

So I will apply a square root transformation for three reasons :

  • I want to increase the gap between ~zero solar radiation and the other values
  • I want to have a better distribution of non~zero values
  • I want to keep the ~zero values as ~zero values, which is not possible with the log transformation
In [25]:
df['solar'] = np.sqrt(df['solar'])
In [26]:
N = len(df) // 100

# Solar radiationdistributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.solar, bins=N, color='#2f9599', alpha=0.3, label='Solar radiation')
# add legends
ax.set_title('Distribution of the Solar radiation')
ax.set_xlabel('sqrt(solar)')
ax.set_ylabel('Count')
ax.legend()

# Solar radiation and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='solar', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='Solar radiation at a given hour', ax=ax)
# add regression line
x, y = df[df.solar>0.05].solar, df[df.solar>0.05].n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression (sqrt(solar) > 0.05)') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Solar radiation')
ax.set_xlabel('sqrt(solar)')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

Rainfall

The rainfall correspond (mm) to the rain fallen during the current hour. The more there are rain, the riskier riding is.

The rain fall is given by the rain fallen for a given our. So estimating the rain have a effect on dryness for 2 hours I will apply moving average of rainfall for the last 2 hours.

As I did for the others meteorological features, I plot the rainfall distribution to see if it need some transformation. Then the other plot is showing the relation between the rainfall and the number of bikes rented (already sqrt-transformed).

In [27]:
df['rain'] = df.rolling(2, min_periods=1)['rain'].mean()
In [28]:
N = len(df) // 100

# Rainfall distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.rain, bins=N, color='#8800ff', alpha=0.3, label='Rainfall')
# add legends
ax.set_title('Distribution of the Rainfall')
ax.set_xlabel('rain')
ax.set_ylabel('Count')
ax.legend()

# Rainfall and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='rain', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Rainfall at a given hour', ax=ax)
# add regression line
x, y = df.rain, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Rainfall')
ax.set_xlabel('rain')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

The plot showing the regression line is clearly indicates the more is it rainy, the less the bikes are used. But I can't clearly see the distribution when there are rain. Let's filter our plots on rainy hours.

In [29]:
_df = df.copy()
_df = _df[_df.rain > 0]
N = len(_df) // 100

# Rainfall distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=_df.rain, bins=N, color='#8800ff', alpha=0.3, label='Rainfall')
# add legends
ax.set_title('Distribution of the Rainfall')
ax.set_xlabel('rain')
ax.set_ylabel('Count')
ax.legend()

# Rainfall and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
_df.plot(x='rain', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Rainfall at a given hour', ax=ax)
# add regression line
x, y = _df.rain, _df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Rainfall')
ax.set_xlabel('rain')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

Here we can see, the rainy hours are really really right skewed. Moreover, we clearly see the inverse proportionnality with the number of rented bikes. So we will try another feature dryness (1/mm) given by $$dryness = \frac{1}{rain + 1}$$ avoiding ZeroDivisionError.

In [30]:
df['dryness'] = 1 / (df.rain + 1)
df = df.drop('rain', axis=1)
In [31]:
N = len(df[df.dryness < 1]) // 100

# Dryness distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df[df.dryness < 1].dryness, bins=N, color='#2f9599', alpha=0.3, label='Dryness')
# add legends
ax.set_title('Distribution of the Dryness (dryness < 1)')
ax.set_xlabel('dryness')
ax.set_ylabel('Count')
ax.legend()

# Dryness and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='dryness', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='Dryness at a given hour', ax=ax)
# add regression line
x, y = df[df.dryness < 1].dryness, df[df.dryness < 1].n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression (dryness < 1)') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the dryness')
ax.set_xlabel('dryness')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

It is a really interpretative transformation. I will keep it as :

  • The dryness distribution (dryness < 1) is gaussian-like
  • The drier, the more the bikes are rented.

Snowfall

The snowfall correspond (mm) to the rain fallen during the current hour. The more there are rain, the riskier riding is.

The snow fall is given by the snow fallen for a given our. So estimating the snow have a effect on dryness for 8 hours in the city of Seoul I will apply moving average of snowfall for the last 8 hours.

I plot the snowfall distribution to see if it need some transformation. Then the other plot is showing the relation between the snowfall and the number of bikes rented (already sqrt-transformed).

In [32]:
df['snow'] = df.rolling(8, min_periods=1)['snow'].mean()
In [33]:
N = len(df) // 100

# Snowfall distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.snow, bins=N, color='#8800ff', alpha=0.3, label='Snowfall')
# add legends
ax.set_title('Distribution of the Snowfall')
ax.set_xlabel('snow')
ax.set_ylabel('Count')
ax.legend()

# Snowfall and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='snow', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Snowfall at a given hour', ax=ax)
# add regression line
x, y = df.snow, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Snowfall')
ax.set_xlabel('snow')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

To be honest, I was thinking the rainfall and the snowfall will be the same. Of course, at first glance, we can conclude the snower, the more the bike will be rented. Yet looking more precisely, we see it is more a categorical variable. If there are 1 or 10 mm of snow, the number of bike rented seems to follow the same distribution. In other words, snow is like disqualifying. So I choose to transform this variable into a boolean one and see the results on a notch box plot

In [34]:
_df = df.copy()
_df['snowing'] = (_df.snow > 0).astype(np.int)

# group by snowing to get box plot
groups = _df.groupby(by=['snowing'], sort=False).n_bike
labels = ['snowing', 'Not snowing']
groups = [groups.get_group(k) for k in groups.groups]

# custom colors
gray_dict = dict(color='#99999988', markeredgecolor='#999999')
colors = ['#8800ff88', '#2f959988']
boxplot = plt.boxplot(
    groups, notch=True, vert=True, patch_artist=True, labels=labels,
    capprops=gray_dict, whiskerprops=gray_dict, flierprops=gray_dict, medianprops=gray_dict
)
# change box colors
for i, patch in enumerate(boxplot['boxes']):
    patch.set_color(colors[i%len(colors)])

plt.title('Distribution of rented bike count according to the snowing')
plt.xlabel('snowing')
plt.ylabel('sqrt(n_bike)')
# display multiple aggregates and remove the temporary columns
_df.groupby('snowing', sort=False).agg({'n_bike': ['count', 'sum', 'mean', 'min', 'max', 'median', 'std']}).apply(lambda row: [f'{x:.1f}' for x in row])
Out[34]:
n_bike
count sum mean min max median std
snowing
0 289.0 4628.3 16.0 3.6 30.6 16.2 5.2
1 8176.0 200735.9 24.6 1.4 59.6 23.9 11.9

Indeed, we can see the distribution of number of bike rented is restricted. So the feature snowing is a great categorical features for snowy hours.

In [35]:
df['snowing'] = (df.snow > 0).astype(np.int)
df = df.drop('snow', axis=1)

Visibility

Only remains the visibility (10m) feature, which correspond to the visibility at horizon for human eye (default 2000)

I plot the visibility distribution to see if it need some transformation. Then the other plot is showing the relation between the visibility and the number of bikes rented (already sqrt-transformed).

In [36]:
N = len(df) // 100

# Visibility distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.visb, bins=N, color='#8800ff', alpha=0.3, label='Visibility')
# add legends
ax.set_title('Distribution of the Visibility')
ax.set_xlabel('visb')
ax.set_ylabel('Count')
ax.legend()

# Visibility and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='visb', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Visibility at a given hour', ax=ax)
# add regression line
x, y = df.visb, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Visibility')
ax.set_xlabel('visb')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

As we can see, the visb is by default at 2000 which is weird to interpret as a maximum visibility. So I decide to change the visibility feature to invisb which reverts the distribution. It will be interpreted as the removed (reduced) visibility (10m)

In [37]:
df['invisb'] = df.visb.max() - df.visb
df = df.drop('visb', axis=1)
In [38]:
N = len(df) // 100

# reduced visibility distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.invisb, bins=N, color='#2f9599', alpha=0.3, label='reduced visibility')
# add legends
ax.set_title('Distribution of the reduced visibility')
ax.set_xlabel('invisb')
ax.set_ylabel('Count')
ax.legend()

# reduced visibility and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='invisb', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='reduced visibility at a given hour', ax=ax)
# add regression line
x, y = df.invisb, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression') 
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the reduced visibility')
ax.set_xlabel('invisb')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass

Now it is easier to interpret. The more the visibility is reduced, the less the bikes are rented.

Dataset preprocessing completion

Before completing the preprocessing of the dataset, I switch the variable containing this data from df to data and we can have a simple look at this data as a reminder.

In [39]:
data = df
data
Out[39]:
n_bike hour temp hum wind dew solar season holiday year month day week_day working_day dryness snowing invisb
0 15.937377 0 -5.2 37 1.163151 -17.6 0.0 Winter 0 2017 December 1 Friday 1 1.0 0 0
1 14.282857 1 -5.5 38 0.587787 -17.6 0.0 Winter 0 2017 December 1 Friday 1 1.0 0 0
2 13.152946 2 -6.0 39 0.693147 -17.7 0.0 Winter 0 2017 December 1 Friday 1 1.0 0 0
3 10.344080 3 -6.2 40 0.641854 -17.6 0.0 Winter 0 2017 December 1 Friday 1 1.0 0 0
4 8.831761 4 -6.0 36 1.193922 -18.6 0.0 Winter 0 2017 December 1 Friday 1 1.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 31.670175 19 4.2 34 1.280934 -10.3 0.0 Autumn 0 2018 November 30 Friday 1 1.0 1 106
8756 27.640550 20 3.4 37 1.193922 -9.9 0.0 Autumn 0 2018 November 30 Friday 1 1.0 1 0
8757 26.343880 21 2.6 39 0.262364 -9.9 0.0 Autumn 0 2018 November 30 Friday 1 1.0 1 32
8758 26.683328 22 2.1 41 0.693147 -9.8 0.0 Autumn 0 2018 November 30 Friday 1 1.0 1 141
8759 24.166092 23 1.9 43 0.832909 -9.3 0.0 Autumn 0 2018 November 30 Friday 1 1.0 1 91

8465 rows × 17 columns

Categorical features : One hot encoding

Let's remember that holiday, working_day, snowing are already one hot encoded as they are boolean feautres. Only remains :

  • hour
  • season
  • month
  • day
  • week_day

Important note : I save every dummies in a dummies.json file

In [40]:
categorical_columns = ['hour', 'season', 'month', 'day', 'week_day']
dummies = {}
for column in categorical_columns:
    unique = [str(x) for x in list(data[column].unique())]
    dummies[column] = unique
    
# save file
with open('dummies.json', 'w') as f:
    json.dump(dummies, f, indent=4)
In [41]:
def dummies_from_file(df, file_name):
    _df = df.copy()
    with open(file_name) as f:
         dummies = json.load(f)


    for column, values in dummies.items():
        for value in values:
            _df[f'{column}_{value}'] = (_df[column].astype(str) == value).astype(int)
        _df = _df.drop(column, axis=1)
    return _df
    
data = dummies_from_file(data, 'dummies.json')
data
Out[41]:
n_bike temp hum wind dew solar holiday year working_day dryness ... day_29 day_30 day_31 week_day_Friday week_day_Saturday week_day_Sunday week_day_Monday week_day_Tuesday week_day_Wednesday week_day_Thursday
0 15.937377 -5.2 37 1.163151 -17.6 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
1 14.282857 -5.5 38 0.587787 -17.6 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
2 13.152946 -6.0 39 0.693147 -17.7 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
3 10.344080 -6.2 40 0.641854 -17.6 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
4 8.831761 -6.0 36 1.193922 -18.6 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 31.670175 4.2 34 1.280934 -10.3 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0
8756 27.640550 3.4 37 1.193922 -9.9 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0
8757 26.343880 2.6 39 0.262364 -9.9 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0
8758 26.683328 2.1 41 0.693147 -9.8 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0
8759 24.166092 1.9 43 0.832909 -9.3 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0

8465 rows × 90 columns

Numerical features : Normalization and Standardization

Now we have all our categorical features ready to use. Let's plot the histogram and describe the numerical ones as a reminder.

In [42]:
N = len(data) // 100
numerical_columns = ['n_bike', 'temp', 'hum', 'wind', 'dew', 'solar', 'dryness']

plt.rcParams['figure.figsize'] = [12, 6] # default = [6.0, 4.0]
fig, axes = plt.subplots(2, 4)
axes = axes.flatten()
axes[-1].remove()
for column, ax in zip(numerical_columns, axes):
    ax.hist(x=data[column], bins=N, color='#8800ff', alpha=0.3)
    # add legends
    ax.set_title(f'Distribution of {column}')

plt.rcParams['figure.figsize'] = [12, 5] # default = [6.0, 4.0]
    
data[numerical_columns].describe()
Out[42]:
n_bike temp hum wind dew solar dryness
count 8465.000000 8465.000000 8465.000000 8465.000000 8465.000000 8465.000000 8465.000000
mean 24.260382 12.771057 58.147194 0.933634 3.944997 0.478287 0.962852
std 11.857802 12.104375 20.484839 0.372602 13.242399 0.582365 0.144701
min 1.414214 -17.800000 0.000000 0.000000 -30.600000 0.000000 0.042553
25% 14.628739 3.000000 42.000000 0.641854 -5.100000 0.000000 1.000000
50% 23.280893 13.500000 57.000000 0.916291 4.700000 0.100000 1.000000
75% 32.924155 22.700000 74.000000 1.193922 15.200000 0.964365 1.000000
max 59.632206 39.400000 98.000000 2.128232 27.200000 1.876166 1.000000

The humidity, and the dryness are in a given interval ([0, 100] and [0, 1]) so I will normalize these features. I will also normalize the solar feature as I want to keep the 0 value the same.

Remains the number of bike, the temperature, the dew point and the wind I will standardize. Particular point for n_bike : I put the mean and std in variables to unfit it later on predictions. And I saved the transformation applied in a transformations.json file

In [43]:
MEAN_LOG_N_BIKE = data.n_bike.mean()
STD_LOG_N_BIKE = data.n_bike.std()
print(f'MEAN_LOG_N_BIKE: {MEAN_LOG_N_BIKE:.2f}')
print(f'STD_LOG_N_BIKE: {STD_LOG_N_BIKE:.2f}')

transformations = {}
# standardization
for column in ['n_bike', 'temp', 'wind', 'dew']:
    mean, std = data[column].mean(), data[column].std()
    transformations[column] = {'function' : 'standardization', 'mean': float(mean), 'std': float(std)}

    
# Normalization
for column in ['hum', 'solar', 'dryness']:
    maxi = data[column].max()
    transformations[column] = {'function' : 'normalization', 'maxi': float(maxi)}

# save file
with open('transformations.json', 'w') as f:
    json.dump(transformations, f, indent=4)
MEAN_LOG_N_BIKE: 24.26
STD_LOG_N_BIKE: 11.86
In [44]:
def norm_from_file(df, file_name):
    _df = df.copy()
    with open(file_name) as f:
         transformations = json.load(f)


    for column, t in transformations.items():
        if t['function'] == 'standardization':
            _df[column] = (_df[column] - t['mean']) / t['std']
        elif t['function'] == 'normalization':
            _df[column] = _df[column] / t['maxi']

    return _df

data = norm_from_file(data, 'transformations.json')
data
Out[44]:
n_bike temp hum wind dew solar holiday year working_day dryness ... day_29 day_30 day_31 week_day_Friday week_day_Saturday week_day_Sunday week_day_Monday week_day_Tuesday week_day_Wednesday week_day_Thursday
0 -0.701901 -1.484675 0.377551 0.615985 -1.626971 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
1 -0.841431 -1.509459 0.387755 -0.928195 -1.626971 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
2 -0.936720 -1.550766 0.397959 -0.645425 -1.634522 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
3 -1.173599 -1.567289 0.408163 -0.783088 -1.626971 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
4 -1.301137 -1.550766 0.367347 0.698571 -1.702486 0.0 0 2017 1 1.0 ... 0 0 0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 0.624888 -0.708096 0.346939 0.932095 -1.075711 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0
8756 0.285059 -0.774188 0.377551 0.698571 -1.045505 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0
8757 0.175707 -0.840279 0.397959 -1.801574 -1.045505 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0
8758 0.204333 -0.881587 0.418367 -0.645425 -1.037954 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0
8759 -0.007952 -0.898110 0.438776 -0.270328 -1.000196 0.0 0 2018 1 1.0 ... 0 1 0 1 0 0 0 0 0 0

8465 rows × 90 columns

In [45]:
N = len(data) // 100
numerical_columns = ['n_bike', 'temp', 'hum', 'wind', 'dew', 'solar', 'dryness']

plt.rcParams['figure.figsize'] = [12, 6] # default = [6.0, 4.0]
fig, axes = plt.subplots(2, 4)
axes = axes.flatten()
axes[-1].remove()
for column, ax in zip(numerical_columns, axes):
    ax.hist(x=data[column], bins=N, color='#2f9599', alpha=0.3)
    # add legends
    ax.set_title(f'Distribution of {column}')

plt.rcParams['figure.figsize'] = [12, 5] # default = [6.0, 4.0]
    
data[numerical_columns].describe()
Out[45]:
n_bike temp hum wind dew solar dryness
count 8.465000e+03 8.465000e+03 8465.000000 8.465000e+03 8.465000e+03 8465.000000 8465.000000
mean 2.052306e-16 5.372089e-17 0.593339 6.043600e-17 -7.722378e-17 0.254928 0.962852
std 1.000000e+00 1.000000e+00 0.209029 1.000000e+00 1.000000e+00 0.310402 0.144701
min -1.926678e+00 -2.525621e+00 0.000000 -2.505715e+00 -2.608666e+00 0.000000 0.042553
25% -8.122621e-01 -8.072335e-01 0.428571 -7.830879e-01 -6.830331e-01 0.000000 1.000000
50% -8.260290e-02 6.022143e-02 0.581633 -4.654579e-02 5.701406e-02 0.053300 1.000000
75% 7.306391e-01 8.202772e-01 0.755102 6.985709e-01 8.499218e-01 0.514008 1.000000
max 2.983000e+00 2.199944e+00 1.000000 3.206099e+00 1.756102e+00 1.000000 1.000000

Train-test split

In [46]:
X, y = data.drop('n_bike', axis=1), data['n_bike']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_test shape: {y_test.shape}')
X_train shape: (5925, 89)
y_train shape: (5925,)
X_test shape: (2540, 89)
y_test shape: (2540,)

Bonus : dataset without any preprocessing

To have a idea on how the preprocessing is important I think it is fun to create a train-test split of the dataset without preprocessing. Of course if I am honest, I will at least make a one hot encoding of categorical variables days, remove the non functionning days and date will be timestamped to be considered as a time serie.

In [47]:
__data = pd.read_csv("SeoulBikeData.csv")
__data.columns = shorter_column_name

__data['func'] = (__data['func'] == "Yes").astype(int)
__data = __data[__data.func == 1].drop('func', axis=1)
__data['holiday'] = (__data['holiday'] == "Holiday").astype(int)
__data['date'] = pd.to_datetime(__data['date'], format="%d/%m/%Y").astype('int64') // 10 ** 9
__data = pd.get_dummies(__data, prefix=['season'], columns=['season'])
__data
Out[47]:
date n_bike hour temp hum wind visb dew solar rain snow holiday season_Autumn season_Spring season_Summer season_Winter
0 1512086400 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 0 0 0 0 1
1 1512086400 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 0 0 0 0 1
2 1512086400 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 0 0 0 0 1
3 1512086400 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 0 0 0 0 1
4 1512086400 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 1543536000 1003 19 4.2 34 2.6 1894 -10.3 0.0 0.0 0.0 0 1 0 0 0
8756 1543536000 764 20 3.4 37 2.3 2000 -9.9 0.0 0.0 0.0 0 1 0 0 0
8757 1543536000 694 21 2.6 39 0.3 1968 -9.9 0.0 0.0 0.0 0 1 0 0 0
8758 1543536000 712 22 2.1 41 1.0 1859 -9.8 0.0 0.0 0.0 0 1 0 0 0
8759 1543536000 584 23 1.9 43 1.3 1909 -9.3 0.0 0.0 0.0 0 1 0 0 0

8465 rows × 16 columns

In [48]:
__X, __y = __data.drop('n_bike', axis=1), __data['n_bike']
__X_train, __X_test, __y_train, __y_test = train_test_split(__X, __y, test_size=0.30, random_state=42)
print(f'__X_train shape: {__X_train.shape}')
print(f'__y_train shape: {__y_train.shape}')
print(f'__X_test shape: {__X_test.shape}')
print(f'__y_test shape: {__y_test.shape}')
__X_train shape: (5925, 15)
__y_train shape: (5925,)
__X_test shape: (2540, 15)
__y_test shape: (2540,)
In [49]:
plt.scatter(__X_train.date, __y_train, color='#8800ff', alpha=0.05, label='Training sample')
plt.scatter(__X_test.date, __y_test, color='#2f9599', alpha=0.1, label='Testing sample')
plt.legend()
plt.title('Distribution of sample over time')
plt.xlabel('timestamp')
plt.ylabel('n_bike')
pass

Machine Learning

As I already said, we have a regression problem. I need to test some regression models. To do so, let's display the models sklearn give us.

In [50]:
all_regs = np.array([name for name, RegressorClass in all_estimators(type_filter='regressor')])
for i in range(0, len(all_regs) // 3, 3):
    j = min(i + 3, len(all_regs))
    print('| '.join([f'{reg:30}' for reg in all_regs[i:j]]))
ARDRegression                 | AdaBoostRegressor             | BaggingRegressor              
BayesianRidge                 | CCA                           | DecisionTreeRegressor         
DummyRegressor                | ElasticNet                    | ElasticNetCV                  
ExtraTreeRegressor            | ExtraTreesRegressor           | GammaRegressor                
GaussianProcessRegressor      | GradientBoostingRegressor     | HistGradientBoostingRegressor 
HuberRegressor                | IsotonicRegression            | KNeighborsRegressor           

There are A LOT of models. I decide to choose some of these models :

  • LinearRegression
  • ElasticNet
  • HuberRegressor
  • BayesianRidge
  • ARDRegression
  • KNeighborsRegressor
  • RandomForestRegressor

IMPORTANT: For each model (except the Baseline of course), the cells will be the exact same except for the first one where the grid search hyperparameters are defined. It leads to an homogeneous way to compare these models. So read the code once.

Useful functions

Before diving into model grid searching and fitting, let's talk about my objectives. For each model :

  • I will perform a grid search on hyperparameters(if needed) on the transformed dataset
  • I will plot the result of the best hyperparameters for the model
  • Then I will do the same process on the untransformed dataset
  • I will compute the RMSE on the prediction for the two datasets to compare with other models later
  • To finish, I will compare the relative error of the prediction on the test set.

Compute the RMSE error

To compute RMSE I need a custom method to return the loss value and unfit the transformed target if needed

In [51]:
def target_unfit(y):
    y = (y * STD_LOG_N_BIKE + MEAN_LOG_N_BIKE) ** 2
    return y

def RMSE(pred, y, unfit=True):
    if unfit:
        pred = target_unfit(pred)
        y = target_unfit(y)
    error = pred - y
    rsme_error = np.sqrt((error ** 2).mean())
    return rsme_error

Plot error distribution

Then I need a method to compute the relative error distribution to have an idea of the error made by the models. (I try this method with random false data)

In [52]:
N = 5000
x = np.random.random(N) 
y = x ** 2 * np.random.random(N) + x * np.random.random(N)
In [53]:
def relative_error_distrib_plot(pred, y, model_name, unfit=True, color='#8800ff', alpha=0.5, label='Sample', ax=None):
    if unfit:
        pred = target_unfit(pred)
        y = target_unfit(y)
    error = pred - y
    if ax is not None:
        ax.hist(error, bins=len(y)//100, color=color, alpha=alpha, label=label)
        ax.legend()
        ax.set_title(f'Relative error distribution with {model_name}')
        ax.set_xlabel('Relative error')
        ax.set_ylabel('Count')
        
    else:
        plt.hist(error, bins=len(y)//100, color=color, alpha=alpha, label=label)
        plt.legend()
        plt.title(f'Relative error distribution with {model_name}')
        plt.xlabel('Relative error')
        plt.ylabel('Count')

relative_error_distrib_plot(x, y, 'no model', unfit=False, label='False sample')

And I also need a method to plot the same relative error according to the target values (Also tested on the false random data)

In [54]:
def relative_error_plot(pred, y, model_name, unfit=True, color='#8800ff', alpha=0.05, label='Sample', ax=None):
    if unfit:
        pred = target_unfit(pred)
        y = target_unfit(y)
    error = pred - y
    
    # add regression lines
    m, b = np.polyfit(y, error, 1)
    
    if ax is not None:
        ax.scatter(y, error, color=color, alpha=alpha, label=label)
        ax.plot(y, m * y + b, color='#fff', linewidth=2) 
        ax.plot(y, m * y + b, color=color, label=f'Linear regression of {label}') 

        ax.legend()
        ax.set_title(f'Relative error with {model_name}')
        ax.set_xlabel('n_bike')
        ax.set_ylabel('Relative error')
    else:
        plt.scatter(y, error, color=color, alpha=alpha, label=label)
        plt.plot(y, m * y + b, color='#fff', linewidth=3) 
        plt.plot(y, m * y + b, color=color, label=f'Linear regression of {label}') 
        plt.legend()
        plt.title(f'Relative error with {model_name}')
        plt.xlabel('n_bike')
        plt.ylabel('Relative error')

relative_error_plot(x, y, 'no model', unfit=False, label='False sample')

Last but ot least, I need a method to perform the grid search homogeneously to be able to honestly compare the following models. It performs a cross-validation shuffling the training set into 4 parts. Also, a random seed is set to make my experiments repoductible.

In [55]:
def grid_search(X, y, model, param_grid={}):
    rkwargs = {'random_state':42}
    try:
        model(**rkwargs)
    except:
        rkwargs = {}
        
    grid = GridSearchCV(
        model(**rkwargs),
        param_grid=param_grid,
        cv=ShuffleSplit(n_splits=4, random_state=42),
        verbose=0,
        n_jobs=-1
    )

    grid.fit(X, y)
    return grid

Baseline

My first model is not a model made by sklearn. It is always a good idea to have a baseline of the error without any model. In other words, I take the mean target for the two datasets and I compute the error on the respectives test sets.

In [56]:
pred = y_train.mean()
rmse = RMSE(pred, y_test, unfit=True)

__pred = __y_train.mean()
__rmse = RMSE(__y_train.mean(), __y_test, unfit=False)


RESULTS = pd.DataFrame.from_dict(data={'model': ['Baseline'], 'is_preprocess':[1], 'RMSE':[rmse]})
RESULTS = RESULTS.append({'model': 'Baseline', 'is_preprocess':0, 'RMSE':__rmse}, ignore_index=True)
RESULTS
Out[56]:
model is_preprocess RMSE
0 Baseline 1 646.328187
1 Baseline 0 633.377438

We can see our preprocessing on the target (sqrt) gives us a poorest baseline on the target. Nothing to worry, on the contrary, it is logical to have a worst prediction as the mean is driven by the extrems. So with our sqrt transform, it leads to underestimate the high extremes.

In [57]:
fig, axes = plt.subplots(1, 2)

ax=axes[0]

relative_error_distrib_plot(pred, y_test, 'Baseline', label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, 'Baseline', unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)

ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, 'Baseline', label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, 'Baseline', unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)

pass

Of course we can see the predictions for with Baseline model are better on the dataset without transformation. Moreover we know the Baseline model on the transformed dataset is underestimated the number of bike rented.

LinearRegression

Model Name: LinearRegression
Description: Ordinary Least Squares algorithm
Prevents Overfitting: no
Handles Outliers: no
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 5 / 5
When to Use: Highly interpretable, no introduced bias
When to Use Expanded:
$\qquad$- Data consists of few outliers
$\qquad$- Little variance between output labels
$\qquad$- All of the input features are not only independent but also are not correlated.
Advantages:
$\qquad$- Easy to interpret results
$\qquad$- Low complexity level
Disadvantages:
$\qquad$- At risk of multicolinearity if input features are correlated
$\qquad$- Small errors/outliers in target values can drastically impact model
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None

Grid Search:

fit_intercept: bool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

normalize: bool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.

In [58]:
model = LinearRegression
model_name = type(model()).__name__

param_grid = {
    'fit_intercept': [True, False],
    'normalize': [True, False]
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']

Fit on transformed and untransformed datasets

In [59]:
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 436 ms, sys: 403 ms, total: 838 ms
Wall time: 1.94 s
Out[59]:
param_fit_intercept param_normalize params mean_test_score std_test_score rank_test_score
2 False True {'fit_intercept': False, 'normalize': True} 0.796253 0.010986 1
3 False False {'fit_intercept': False, 'normalize': False} 0.796253 0.010986 1
1 True False {'fit_intercept': True, 'normalize': False} 0.796253 0.010987 3
0 True True {'fit_intercept': True, 'normalize': True} 0.795179 0.010262 4
In [60]:
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 202 ms, sys: 182 ms, total: 384 ms
Wall time: 110 ms
Out[60]:
param_fit_intercept param_normalize params mean_test_score std_test_score rank_test_score
1 True False {'fit_intercept': True, 'normalize': False} 0.543789 0.012893 1
0 True True {'fit_intercept': True, 'normalize': True} 0.543789 0.012893 2
2 False True {'fit_intercept': False, 'normalize': True} 0.543789 0.012893 3
3 False False {'fit_intercept': False, 'normalize': False} 0.543789 0.012893 3

Relative errors on test sets

In [61]:
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
In [62]:
fig, axes = plt.subplots(1, 2)

ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)

ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass

Here we have our first prove of the transformation's effience. For this simple linear model we have a clearly better prediction with the transformed dataset. It proves the new featuresare individually meaningful compared to the raw ones. So we can try the ElasticNet which can be useful when there are some correlation between input features and to avoid overfitting.

Saving

In [63]:
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)

params = grid.best_params_
__params = __grid.best_params_

RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Out[63]:
model is_preprocess RMSE fit_intercept normalize
0 Baseline 1 646.328187 NaN NaN
1 Baseline 0 633.377438 NaN NaN
2 LinearRegression 1 294.091808 0.0 1.0
3 LinearRegression 0 422.606908 1.0 0.0

ElasticNet

Model Name: ElasticNet
Description: Ordinary Least Squares with both an L1 and L2 regularization term. The weights of the L1 vs. L2 regularization terms are controlled by an l1_ratio parameter.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: yes
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 3 / 5
When to Use: Blend Ridge and Lasso
When to Use Expanded:
$\qquad$- Data consists of few outliers
$\qquad$- May be some correlation between input features
$\qquad$- Avoid overfitting
$\qquad$- Feature selection
Advantages:
$\qquad$- Incorporate the feature selection abilities of Lasso with the regularization abilities of Ridge.
Disadvantages:
$\qquad$- In lowering variance, incorporates a degree of bias into the model.
$\qquad$- Can be difficult to tune alpha to attain a desirable balance between OLS and regularization terms
$\qquad$- Higher computational cost than Ridge or Lasso
Sklearn Package: linear
Required Args: None
Helpful Args: alpha (controls strength of regularization terms) l1_ration (controls ratio between L1 and L2 regularization terms)
Variations: ElasticNetCV MultiTaskElasticNet

Grid Search:

alpha: float, default=1.0

Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact >mathematical meaning of this parameter. alpha = 0 is equivalent to an ordinary least square, >solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso >object is not advised. Given this, you should use the LinearRegression object.
l1_ratio: float, default=0.5

The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

In [64]:
model = ElasticNet
model_name = type(model()).__name__


param_grid = {
    'alpha': [0.1, 0.25, 0.5, 0.75, 0.9, 1],
    'l1_ratio': [0, 0.2, 0.5, 0.7, 1], 
    'max_iter': [10000]
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']

Fit on transformed and untransformed datasets

In [65]:
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 34 s, sys: 30 s, total: 1min 3s
Wall time: 1min 34s
Out[65]:
param_alpha param_l1_ratio param_max_iter params mean_test_score std_test_score rank_test_score
0 0.1 0 10000 {'alpha': 0.1, 'l1_ratio': 0, 'max_iter': 10000} 0.627453 0.011237 1
5 0.25 0 10000 {'alpha': 0.25, 'l1_ratio': 0, 'max_iter': 10000} 0.528692 0.014287 2
1 0.1 0.2 10000 {'alpha': 0.1, 'l1_ratio': 0.2, 'max_iter': 10... 0.513821 0.012884 3
10 0.5 0 10000 {'alpha': 0.5, 'l1_ratio': 0, 'max_iter': 10000} 0.460413 0.015214 4
15 0.75 0 10000 {'alpha': 0.75, 'l1_ratio': 0, 'max_iter': 10000} 0.422454 0.015185 5
In [66]:
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 1.02 s, sys: 609 ms, total: 1.63 s
Wall time: 9.56 s
Out[66]:
param_alpha param_l1_ratio param_max_iter params mean_test_score std_test_score rank_test_score
19 0.75 1 10000 {'alpha': 0.75, 'l1_ratio': 1, 'max_iter': 10000} 0.544212 0.015101 1
24 0.9 1 10000 {'alpha': 0.9, 'l1_ratio': 1, 'max_iter': 10000} 0.544202 0.015544 2
29 1 1 10000 {'alpha': 1, 'l1_ratio': 1, 'max_iter': 10000} 0.544179 0.015839 3
14 0.5 1 10000 {'alpha': 0.5, 'l1_ratio': 1, 'max_iter': 10000} 0.544157 0.014364 4
9 0.25 1 10000 {'alpha': 0.25, 'l1_ratio': 1, 'max_iter': 10000} 0.544016 0.013628 5

Relative errors on test sets

In [67]:
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
In [68]:
fig, axes = plt.subplots(1, 2)

ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)

ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass

In this case, the model on untransformed dataset is better. It is because data consists of fewer outliers compare to the transformed dataset due to the high number of features. I am tented to try the HuberRegressor, which is good for quick analyses of data ignoring outliers even if it can break down with large numbers of input features.

Saving

In [69]:
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)

params = grid.best_params_
__params = __grid.best_params_

RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Out[69]:
model is_preprocess RMSE fit_intercept normalize alpha l1_ratio max_iter
0 Baseline 1 646.328187 NaN NaN NaN NaN NaN
1 Baseline 0 633.377438 NaN NaN NaN NaN NaN
2 LinearRegression 1 294.091808 0.0 1.0 NaN NaN NaN
3 LinearRegression 0 422.606908 1.0 0.0 NaN NaN NaN
4 ElasticNet 1 420.322384 NaN NaN 0.10 0.0 10000.0
5 ElasticNet 0 422.842630 NaN NaN 0.75 1.0 10000.0

HuberRegressor

Model Name: HuberRegressor
Description: A linear model designed to deal with outliers in the data and/or corrupted data. Does not ignore the outliers, but rather gives them a lower weight.
Prevents Overfitting: no
Handles Outliers: yes
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 4 / 5
When to Use: Outliers and want quickest algorithm
When to Use Expanded:
$\qquad$- Want quick analyses of data ignoring outliers
Advantages: '
$\qquad$- Faster than RANSAC and TheilSen (as long as the number of samples is not too large)
$\qquad$- Does not completely ignore data points it deems as outliers
Disadvantages:
$\qquad$- Break down with large numbers of input features
Sklearn Package: linear
Required Args: None
Helpful Args: max_iter (maximum number of iterations to perform)
Variations: None

Grid Search:

epsilon: float, greater than 1.0, default 1.35 The parameter epsilon controls the number of samples that should be classified as outliers. The smaller the epsilon, the more robust it is to outliers.

In [70]:
model = HuberRegressor
model_name = type(model()).__name__


param_grid = {
    'epsilon': [1, 1.2, 1.35, 1.5, 2, 3, 5, 10],
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']

Fit on transformed and untransformed datasets

In [71]:
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 2.39 s, sys: 2 s, total: 4.39 s
Wall time: 5.84 s
Out[71]:
param_epsilon params mean_test_score std_test_score rank_test_score
3 1.5 {'epsilon': 1.5} 0.712086 0.016685 1
7 10 {'epsilon': 10} 0.706730 0.010067 2
5 3 {'epsilon': 3} 0.706533 0.012266 3
6 5 {'epsilon': 5} 0.699158 0.017403 4
1 1.2 {'epsilon': 1.2} 0.696373 0.034517 5
In [72]:
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 338 ms, sys: 259 ms, total: 597 ms
Wall time: 333 ms
Out[72]:
param_epsilon params mean_test_score std_test_score rank_test_score
4 2 {'epsilon': 2} -0.065557 0.012171 1
5 3 {'epsilon': 3} -0.065585 0.012266 2
3 1.5 {'epsilon': 1.5} -0.065589 0.012170 3
2 1.35 {'epsilon': 1.35} -0.065600 0.012171 4
1 1.2 {'epsilon': 1.2} -0.065611 0.012185 5

Relative errors on test sets

In [73]:
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
In [74]:
fig, axes = plt.subplots(1, 2)

ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)

ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass

Interestingly, the HuberRegressor does not work on the untransformed dataset. It seems to underfits the data by completly ignoring the outliers. I have to make a compromise between ElasticNet and HuberRegressor so I will choose to try the Bayensian Rigde.

Saving

In [75]:
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)

params = grid.best_params_
__params = __grid.best_params_

RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Out[75]:
model is_preprocess RMSE fit_intercept normalize alpha l1_ratio max_iter epsilon
0 Baseline 1 646.328187 NaN NaN NaN NaN NaN NaN
1 Baseline 0 633.377438 NaN NaN NaN NaN NaN NaN
2 LinearRegression 1 294.091808 0.0 1.0 NaN NaN NaN NaN
3 LinearRegression 0 422.606908 1.0 0.0 NaN NaN NaN NaN
4 ElasticNet 1 420.322384 NaN NaN 0.10 0.0 10000.0 NaN
5 ElasticNet 0 422.842630 NaN NaN 0.75 1.0 10000.0 NaN
6 HuberRegressor 1 371.638692 NaN NaN NaN NaN NaN 1.5
7 HuberRegressor 0 654.306176 NaN NaN NaN NaN NaN 2.0

BayesianRidge

Model Name: BayesianRidge
Description: Similar to Ridge but the regularization parameter is tuned to fit the data during the training process.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: no
Adaptive Regularization: yes
Large Dataset: no
Non linear: no
Interpretability Score: 2 / 5
When to Use: Ridge but don't want to set regularization constant
When to Use Expanded:
$\qquad$- Are seeking results similar to Ridge, but willing to sacrifice interpretability for time saved not having to test different regularization weights
Advantages:
$\qquad$- No need to tune alpha value
$\qquad$- Adapts well to data on hand
Disadvantages:
$\qquad$- Less interpretable results
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None

Grid Search:

n_iter: int, default=300
Maximum number of iterations. Should be greater than or equal to 1.

In [76]:
model = BayesianRidge
model_name = type(model()).__name__


param_grid = {
    'n_iter': [5, 20, 50, 100, 200, 300],
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']

Fit on transformed and untransformed datasets

In [77]:
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 284 ms, sys: 165 ms, total: 449 ms
Wall time: 1.21 s
Out[77]:
param_n_iter params mean_test_score std_test_score rank_test_score
0 5 {'n_iter': 5} 0.79665 0.010651 1
1 20 {'n_iter': 20} 0.79665 0.010651 1
2 50 {'n_iter': 50} 0.79665 0.010651 1
3 100 {'n_iter': 100} 0.79665 0.010651 1
4 200 {'n_iter': 200} 0.79665 0.010651 1
In [78]:
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 209 ms, sys: 187 ms, total: 395 ms
Wall time: 156 ms
Out[78]:
param_n_iter params mean_test_score std_test_score rank_test_score
1 20 {'n_iter': 20} 0.54392 0.014115 1
2 50 {'n_iter': 50} 0.54392 0.014115 1
3 100 {'n_iter': 100} 0.54392 0.014115 1
4 200 {'n_iter': 200} 0.54392 0.014115 1
5 300 {'n_iter': 300} 0.54392 0.014115 1

Relative errors on test sets

In [79]:
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
In [80]:
fig, axes = plt.subplots(1, 2)

ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)

ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass

Here we are ! The BayesianRigde show us one thing. It can be hard to be sure of that but let's summarize my thoughs :

  • The BayesianRigde is like Ridge (or ElasticNet in a way) and don't need to tune alpha value for regularization.
  • The error on the best BayesianRigde is the exact same as the best LinearRegression.

So I think there is no regularization needed as the LinearRegression is the best model for now and the other models tried are some variants with Regularization. I am no sure, if I try another variant, it will lead to the same conclusion. Let's go for the ARDRegression.

Saving

In [81]:
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)

params = grid.best_params_
__params = __grid.best_params_

RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Out[81]:
model is_preprocess RMSE fit_intercept normalize alpha l1_ratio max_iter epsilon n_iter
0 Baseline 1 646.328187 NaN NaN NaN NaN NaN NaN NaN
1 Baseline 0 633.377438 NaN NaN NaN NaN NaN NaN NaN
2 LinearRegression 1 294.091808 0.0 1.0 NaN NaN NaN NaN NaN
3 LinearRegression 0 422.606908 1.0 0.0 NaN NaN NaN NaN NaN
4 ElasticNet 1 420.322384 NaN NaN 0.10 0.0 10000.0 NaN NaN
5 ElasticNet 0 422.842630 NaN NaN 0.75 1.0 10000.0 NaN NaN
6 HuberRegressor 1 371.638692 NaN NaN NaN NaN NaN 1.5 NaN
7 HuberRegressor 0 654.306176 NaN NaN NaN NaN NaN 2.0 NaN
8 BayesianRidge 1 294.599647 NaN NaN NaN NaN NaN NaN 5.0
9 BayesianRidge 0 422.677868 NaN NaN NaN NaN NaN NaN 20.0

ARDRegression

Model Name: ARDRegression
Description: BayesianRidge with sparser weight values. Almost like a version of BayesianLasso.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: yes
Adaptive Regularization: yes
Large Dataset: no
Non linear: no
Interpretability Score: 2 /5
When to Use: Lasso but don't want to set regularization constant
When to Use Expanded:
$\qquad$- Are seeking results similar to Lasso, but willing to sacrifice interpretability for time saved not having to test different regularization term weights
Advantages:
$\qquad$- No need to tune alpha value
$\qquad$- Adapts well to data on hand
$\qquad$- Reduces weight of unimportant features
Disadvantages: '
$\qquad$- Less interpretable results
$\qquad$- Computationally expensive (can't handle very large datasets)
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None

Grid Search:

n_iter: n_iterint, default=300
Maximum number of iterations.

In [82]:
model = ARDRegression
model_name = type(model()).__name__


param_grid = {
    'n_iter': [5, 20, 50, 100, 200, 300],
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']

Fit on transformed and untransformed datasets

In [83]:
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 1.41 s, sys: 1.75 s, total: 3.16 s
Wall time: 1.53 s
Out[83]:
param_n_iter params mean_test_score std_test_score rank_test_score
2 50 {'n_iter': 50} 0.796176 0.010012 1
3 100 {'n_iter': 100} 0.796176 0.010012 1
4 200 {'n_iter': 200} 0.796176 0.010012 1
5 300 {'n_iter': 300} 0.796176 0.010012 1
1 20 {'n_iter': 20} 0.796164 0.010020 5
In [84]:
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 301 ms, sys: 341 ms, total: 642 ms
Wall time: 360 ms
Out[84]:
param_n_iter params mean_test_score std_test_score rank_test_score
2 50 {'n_iter': 50} 0.540773 0.019104 1
3 100 {'n_iter': 100} 0.540773 0.019104 1
4 200 {'n_iter': 200} 0.540773 0.019104 1
5 300 {'n_iter': 300} 0.540773 0.019104 1
1 20 {'n_iter': 20} 0.540773 0.019104 5

Relative errors on test sets

In [85]:
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
In [86]:
fig, axes = plt.subplots(1, 2)

ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)

ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass

And yes, I obtain the same conclusion I had for the BayesianRidge. It is time to stop trying linear models and move on for other regressors. I will try the KNeighborsRegressor.

Saving

In [87]:
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)

params = grid.best_params_
__params = __grid.best_params_

RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Out[87]:
model is_preprocess RMSE fit_intercept normalize alpha l1_ratio max_iter epsilon n_iter
0 Baseline 1 646.328187 NaN NaN NaN NaN NaN NaN NaN
1 Baseline 0 633.377438 NaN NaN NaN NaN NaN NaN NaN
2 LinearRegression 1 294.091808 0.0 1.0 NaN NaN NaN NaN NaN
3 LinearRegression 0 422.606908 1.0 0.0 NaN NaN NaN NaN NaN
4 ElasticNet 1 420.322384 NaN NaN 0.10 0.0 10000.0 NaN NaN
5 ElasticNet 0 422.842630 NaN NaN 0.75 1.0 10000.0 NaN NaN
6 HuberRegressor 1 371.638692 NaN NaN NaN NaN NaN 1.5 NaN
7 HuberRegressor 0 654.306176 NaN NaN NaN NaN NaN 2.0 NaN
8 BayesianRidge 1 294.599647 NaN NaN NaN NaN NaN NaN 5.0
9 BayesianRidge 0 422.677868 NaN NaN NaN NaN NaN NaN 20.0
10 ARDRegression 1 295.043718 NaN NaN NaN NaN NaN NaN 50.0
11 ARDRegression 0 424.484307 NaN NaN NaN NaN NaN NaN 50.0

KNeighborsRegressor

Model Name: KNeighborsRegressor
Description: Creates a model based off of the k nearest neighbors at any given point. Where k is an input argument.
Prevents Overfitting: no
Handles Outliers: no
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: yes
Interpretability Score: 5 / 5
When to Use: Nonlinear data, interpretability is important, unimportant features
When to Use Expanded:
$\qquad$- When you are unsure of the structure of your data and want a model that will fit well
$\qquad$- Not concerned with overfitting
$\qquad$- Interpretability is important
Advantages:
$\qquad$- Fits very well to data of various structures
$\qquad$- More interpretable than other nonlinear models
Disadvantages:
$\qquad$- Extremely impacted by outliers and corrupt data
$\qquad$- Need several more samples than features for quality results
$\qquad$- Difficulty dealing with large numbers of features
Sklearn Package: neighbors
Required Args: None
Helpful Args: n_neighbors (number of neighbors to use)
Variations: None

Grid Search:

n_neighbors: sint, default=5
Number of neighbors to use by default for kneighbors queries.

weights: str or callable, default=’uniform’
weight function used in prediction. Possible values: ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. And ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

p: int, default=2
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

In [88]:
model = KNeighborsRegressor
model_name = type(model()).__name__


param_grid = {
    'n_neighbors': [3, 5, 10, 15, 20],
    'weights': ['uniform', 'distance'],
    'algorithm': ['ball_tree', 'kd_tree'],
    'p': [0.5, 1, 2]
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']

Fit on transformed and untransformed datasets

In [89]:
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 425 ms, sys: 47.4 ms, total: 472 ms
Wall time: 4.71 s
Out[89]:
param_n_neighbors param_weights param_algorithm param_p params mean_test_score std_test_score rank_test_score
39 5 distance kd_tree 1 {'algorithm': 'kd_tree', 'n_neighbors': 5, 'p'... 0.451416 0.012672 1
9 5 distance ball_tree 1 {'algorithm': 'ball_tree', 'n_neighbors': 5, '... 0.451416 0.012672 1
15 10 distance ball_tree 1 {'algorithm': 'ball_tree', 'n_neighbors': 10, ... 0.446125 0.028077 3
45 10 distance kd_tree 1 {'algorithm': 'kd_tree', 'n_neighbors': 10, 'p... 0.446125 0.028077 3
51 15 distance kd_tree 1 {'algorithm': 'kd_tree', 'n_neighbors': 15, 'p... 0.436413 0.028623 5
In [90]:
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 241 ms, sys: 16.6 ms, total: 258 ms
Wall time: 971 ms
Out[90]:
param_n_neighbors param_weights param_algorithm param_p params mean_test_score std_test_score rank_test_score
39 5 distance kd_tree 1 {'algorithm': 'kd_tree', 'n_neighbors': 5, 'p'... 0.644304 0.040958 1
9 5 distance ball_tree 1 {'algorithm': 'ball_tree', 'n_neighbors': 5, '... 0.644304 0.040958 1
33 3 distance kd_tree 1 {'algorithm': 'kd_tree', 'n_neighbors': 3, 'p'... 0.634172 0.039352 3
3 3 distance ball_tree 1 {'algorithm': 'ball_tree', 'n_neighbors': 3, '... 0.634172 0.039352 3
15 10 distance ball_tree 1 {'algorithm': 'ball_tree', 'n_neighbors': 10, ... 0.630314 0.026451 5

Relative errors on test sets

In [91]:
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
In [92]:
fig, axes = plt.subplots(1, 2)

ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)

ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass

We have really poor results on the transformed dataset. It can be easly explained :

  • It is useful non linear data and unimportant features. Yet some of our features are definetly imporant as with saw on our analyis part.
  • Also, our dataset if the combination of the 3 disadvantages of this mode :
    • Extremely impacted by outliers and corrupt data
    • Need several more samples than features for quality results
    • Difficulty dealing with large numbers of features

One RandomForestRegressor can deal with these disadvantages.

Saving

In [93]:
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)

params = grid.best_params_
__params = __grid.best_params_

RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Out[93]:
model is_preprocess RMSE fit_intercept normalize alpha l1_ratio max_iter epsilon n_iter algorithm n_neighbors p weights
0 Baseline 1 646.328187 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Baseline 0 633.377438 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 LinearRegression 1 294.091808 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 LinearRegression 0 422.606908 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 ElasticNet 1 420.322384 NaN NaN 0.10 0.0 10000.0 NaN NaN NaN NaN NaN NaN
5 ElasticNet 0 422.842630 NaN NaN 0.75 1.0 10000.0 NaN NaN NaN NaN NaN NaN
6 HuberRegressor 1 371.638692 NaN NaN NaN NaN NaN 1.5 NaN NaN NaN NaN NaN
7 HuberRegressor 0 654.306176 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN
8 BayesianRidge 1 294.599647 NaN NaN NaN NaN NaN NaN 5.0 NaN NaN NaN NaN
9 BayesianRidge 0 422.677868 NaN NaN NaN NaN NaN NaN 20.0 NaN NaN NaN NaN
10 ARDRegression 1 295.043718 NaN NaN NaN NaN NaN NaN 50.0 NaN NaN NaN NaN
11 ARDRegression 0 424.484307 NaN NaN NaN NaN NaN NaN 50.0 NaN NaN NaN NaN
12 KNeighborsRegressor 1 481.712121 NaN NaN NaN NaN NaN NaN NaN ball_tree 5.0 1.0 distance
13 KNeighborsRegressor 0 380.199394 NaN NaN NaN NaN NaN NaN NaN ball_tree 5.0 1.0 distance

RandomForestRegressor

Model Name: RandomForestRegressor
Description: A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Prevents Overfitting: yes
Handles Outliers: yes
Handles several features: yes
Adaptive Regularization: no
Large Dataset: yes
Non linear: yes
Interpretability Score: 3 / 5
When to Use: Nonlinear data groups in buckets
When to Use Expanded:
$\qquad$- Data is not linear and is composed more of "buckets"
$\qquad$- Number of samples > number of features
$\qquad$- There are dependent features in the input data. DTR handles these correlations well.
Advantages:
$\qquad$- Can export tree structure to see which features the tree is splitting on
$\qquad$- Handles sparse and correlated data well
$\qquad$- Able to tune the model to help with overfitting problem
Disadvantages:
$\qquad$- Prediction accuracy on complex problems is usually inferior to gradient-boosted trees.
$\qquad$- A forest is less interpretable than a single decision tree.
Sklearn Package: tree
Required Args: None
Helpful Args: criterion and max_depth
Variations: gradient-boosted trees

Grid Search:

n_estimators: int, default=100
The number of trees in the forest.

criterion: str, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.

max_depth: int, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

max_features: str, int or float, default=”auto”
The number of features to consider when looking for the best split.

In [94]:
model = RandomForestRegressor
model_name = type(model()).__name__

param_grid = {
    'max_depth': [None, 10, 50, 100],
    'max_features' : ['auto', 'sqrt', 'log2'], 
    'n_estimators': [10, 50, 100],
    'criterion': ['mse', 'mae']
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']

Fit on transformed and untransformed datasets

In [95]:
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 6.51 s, sys: 125 ms, total: 6.63 s
Wall time: 25min 50s
Out[95]:
param_max_depth param_max_features param_n_estimators param_criterion params mean_test_score std_test_score rank_test_score
20 50 auto 100 mse {'criterion': 'mse', 'max_depth': 50, 'max_fea... 0.907613 0.004603 1
29 100 auto 100 mse {'criterion': 'mse', 'max_depth': 100, 'max_fe... 0.907579 0.004570 2
2 None auto 100 mse {'criterion': 'mse', 'max_depth': None, 'max_f... 0.907579 0.004570 2
19 50 auto 50 mse {'criterion': 'mse', 'max_depth': 50, 'max_fea... 0.906034 0.005074 4
28 100 auto 50 mse {'criterion': 'mse', 'max_depth': 100, 'max_fe... 0.905993 0.005013 5
In [96]:
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
CPU times: user 3.48 s, sys: 96 ms, total: 3.57 s
Wall time: 6min 46s
Out[96]:
param_max_depth param_max_features param_n_estimators param_criterion params mean_test_score std_test_score rank_test_score
2 None auto 100 mse {'criterion': 'mse', 'max_depth': None, 'max_f... 0.878744 0.010214 1
20 50 auto 100 mse {'criterion': 'mse', 'max_depth': 50, 'max_fea... 0.878744 0.010214 1
29 100 auto 100 mse {'criterion': 'mse', 'max_depth': 100, 'max_fe... 0.878744 0.010214 1
1 None auto 50 mse {'criterion': 'mse', 'max_depth': None, 'max_f... 0.877307 0.009597 4
19 50 auto 50 mse {'criterion': 'mse', 'max_depth': 50, 'max_fea... 0.877307 0.009597 4

Relative errors on test sets

In [97]:
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
In [98]:
fig, axes = plt.subplots(1, 2)

ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)

ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass

The RandomForestRegressor is the best model I have tried. It reach the best results fot both datasets, transformed and untransformed data and is a few better on the transformed dataset. I will stop the models enumeration and will compare the final results.

Saving

In [99]:
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)

params = grid.best_params_
__params = __grid.best_params_

RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Out[99]:
model is_preprocess RMSE fit_intercept normalize alpha l1_ratio max_iter epsilon n_iter algorithm n_neighbors p weights criterion max_depth max_features n_estimators
0 Baseline 1 646.328187 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Baseline 0 633.377438 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 LinearRegression 1 294.091808 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 LinearRegression 0 422.606908 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 ElasticNet 1 420.322384 NaN NaN 0.10 0.0 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 ElasticNet 0 422.842630 NaN NaN 0.75 1.0 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 HuberRegressor 1 371.638692 NaN NaN NaN NaN NaN 1.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 HuberRegressor 0 654.306176 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 BayesianRidge 1 294.599647 NaN NaN NaN NaN NaN NaN 5.0 NaN NaN NaN NaN NaN NaN NaN NaN
9 BayesianRidge 0 422.677868 NaN NaN NaN NaN NaN NaN 20.0 NaN NaN NaN NaN NaN NaN NaN NaN
10 ARDRegression 1 295.043718 NaN NaN NaN NaN NaN NaN 50.0 NaN NaN NaN NaN NaN NaN NaN NaN
11 ARDRegression 0 424.484307 NaN NaN NaN NaN NaN NaN 50.0 NaN NaN NaN NaN NaN NaN NaN NaN
12 KNeighborsRegressor 1 481.712121 NaN NaN NaN NaN NaN NaN NaN ball_tree 5.0 1.0 distance NaN NaN NaN NaN
13 KNeighborsRegressor 0 380.199394 NaN NaN NaN NaN NaN NaN NaN ball_tree 5.0 1.0 distance NaN NaN NaN NaN
14 RandomForestRegressor 1 185.560972 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN mse 50 auto 100.0
15 RandomForestRegressor 0 218.663397 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN mse None auto 100.0

Saving best results

In [100]:
RESULTS.to_csv('model_results.csv', index=False)

Model comparison

In [101]:
results = pd.read_csv('model_results_final.csv')
results
Out[101]:
model is_preprocess RMSE fit_intercept normalize alpha l1_ratio max_iter epsilon n_iter algorithm n_neighbors p weights criterion max_depth max_features n_estimators
0 Baseline 1 646.328187 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Baseline 0 633.377438 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 LinearRegression 1 294.091740 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 LinearRegression 0 422.606908 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 ElasticNet 1 420.324355 NaN NaN 0.10 0.0 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 ElasticNet 0 422.842630 NaN NaN 0.75 1.0 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 HuberRegressor 1 355.849712 NaN NaN NaN NaN NaN 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 HuberRegressor 0 654.306176 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 BayesianRidge 1 294.599579 NaN NaN NaN NaN NaN NaN 5.0 NaN NaN NaN NaN NaN NaN NaN NaN
9 BayesianRidge 0 422.677868 NaN NaN NaN NaN NaN NaN 20.0 NaN NaN NaN NaN NaN NaN NaN NaN
10 ARDRegression 1 295.043722 NaN NaN NaN NaN NaN NaN 50.0 NaN NaN NaN NaN NaN NaN NaN NaN
11 ARDRegression 0 424.484307 NaN NaN NaN NaN NaN NaN 50.0 NaN NaN NaN NaN NaN NaN NaN NaN
12 KNeighborsRegressor 1 481.546507 NaN NaN NaN NaN NaN NaN NaN ball_tree 5.0 1.0 distance NaN NaN NaN NaN
13 KNeighborsRegressor 0 380.199394 NaN NaN NaN NaN NaN NaN NaN ball_tree 5.0 1.0 distance NaN NaN NaN NaN
14 RandomForestRegressor 1 186.311654 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN mse 50.0 auto 100.0
15 RandomForestRegressor 0 218.663397 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN mse NaN auto 100.0

As I ran the same cells to have an homogeneous way to compare the models, the model comparison will be really easy. I only group the results but dataset and sort the RMSE values ascendingly according to the transformed dataset.

In [102]:
results_pivot = results.pivot("model", "is_preprocess", "RMSE")
results_pivot = results_pivot.reindex(results_pivot.sort_values(by=1, ascending=True).index)

results_pivot.plot(kind='bar', color=['#2f9599', '#80f'], alpha=0.5, edgecolor='#fff')
plt.title('Model Comparison')
plt.xlabel('Models')
plt.ylabel('RMSE on test sets')
results_pivot
Out[102]:
is_preprocess 0 1
model
RandomForestRegressor 218.663397 186.311654
LinearRegression 422.606908 294.091740
BayesianRidge 422.677868 294.599579
ARDRegression 424.484307 295.043722
HuberRegressor 654.306176 355.849712
ElasticNet 422.842630 420.324355
KNeighborsRegressor 380.199394 481.546507
Baseline 633.377438 646.328187

Here we have the same conclusions, the KNeighborsRegressor is a poor model to predict the number of rented bikes, then the Linear models are better preferably without any regularization. And the RandomForestRegressor is clearly the best model to choose for the two dataset, even it works better on the transformed one.

It can be interesting to see what are the most important features on our case to predict the number of bike rented for a given hour. Of course, as we have 89 feautures I will limit the bar plot to the top 30 features considering the other ones are not important to our result.

In [103]:
rf_model = grid.best_estimator_
feature_importances = {
    feature: importance
    for feature, importance in zip(X_train.columns, rf_model.feature_importances_)
}
feature_importances = {k: v for k, v in sorted(feature_importances.items(), key=lambda item: item[1], reverse=True)[:30]}

plt.bar(range(len(feature_importances)), list(feature_importances.values()), align='center', color='#8800ff', alpha=0.5)
plt.xticks(range(len(feature_importances)), list(feature_importances.keys()), rotation=90)
plt.title('Feature Importance (top 30 only)')
plt.xlabel('Features')
plt.ylabel('Importance')
pass

Not surprisingly, the meteorological feautures are globally the most important ones. The the working hours are following with 18h for instance.

I we can conclude from this plot :

- There are more bikes rented when the weather is good and the temperature is high
- There are more bikes rented during peak hours

Deploying model

I have chosen my model so I need to create the methods for the model to be deployed.

Save chosen model

First I save my RandomForestRegressor model

In [104]:
pkl_filename = "rf_model.pkl"
with open(pkl_filename, 'wb') as f:
    pickle.dump(rf_model, f)

Preprocessing functions client side

On the client side, it will be ask to create a matrix of features to predict. Each row is a list of ordered features as following :

Feature Format
Date dd/mm/yyyy
Hour int
Temperature °C
Humidity %
Wind speed m/s
Visibility 10m
Dew point temperature °C
Solar Radiation MJ/m2
Rainfal mm
Snowfall cm
Seasons {"Winter", "Autumn", "Spring", "Summer"}
Holiday {"Holiday", "No Holiday"} Feature Format

I recreate the preprocessing function :

  • set holiday feautre as boolean
  • extract time features
  • trasnform meteorological arguments
  • complete preprocessing with one hot encoding on categorical feautures and norm/standardization on numerical ones
In [105]:
def dummies_from_file(df, file_name):
    _df = df.copy()
    with open(file_name) as f:
         dummies = json.load(f)


    for column, values in dummies.items():
        for value in values:
            _df[f'{column}_{value}'] = (_df[column].astype(str) == value).astype(int)
        _df = _df.drop(column, axis=1)
    return _df
    

def norm_from_file(df, file_name):
    _df = df.copy()
    with open(file_name) as f:
         transformations = json.load(f)


    for column, t in transformations.items():
        if column == 'n_bike':
            continue
        elif t['function'] == 'standardization':
            _df[column] = (_df[column] - t['mean']) / t['std']
        elif t['function'] == 'normalization':
            _df[column] = _df[column] / t['maxi']

    return _df

def preprocess(values, return_df=False):
    _df = pd.DataFrame(values, columns = [
     'date',    # Date (dd/mm/yyyy)
     'hour',    # Hour (int)
     'temp',    # Temperature (°C)
     'hum',     # Humidity (%)
     'wind',    # Wind speed (m/s)
     'visb',    # Visibility (10m)
     'dew',     # Dew point temperature (°C)
     'solar',   # Solar Radiation (MJ/m2)
     'rain',    # Rainfall(mm)
     'snow',    # Snowfall (cm)
     'season',  # Seasons ({"Winter", "Autumn", "Spring", "Summer"})
     'holiday', # Holiday ({"Holiday", "No Holiday"})
    ])


    # set boolean string to boolean
    _df['holiday'] = (_df['holiday'] == "Holiday").astype(int)

    # get datetime and its arguments
    _df['date'] = pd.to_datetime(_df['date'], format="%d/%m/%Y")
    _df['year'] = _df['date'].dt.year
    _df['month'] = _df['date'].dt.month_name()
    _df['day'] = _df['date'].dt.day
    _df['week_day'] = _df['date'].dt.day_name()
    _df['working_day'] = (_df['date'].dt.dayofweek < 5).astype(np.int)
    _df = _df.drop('date', axis=1)
    # meteorological arguments
    _df['wind'] = np.log(_df['wind'].astype(float) + 1)
    _df['solar'] = np.sqrt(_df['solar'].astype(float))
    _df['rain'] = _df.rolling(2, min_periods=1)['rain'].mean()
    _df['dryness'] = 1 / (_df.rain + 1)
    _df = _df.drop('rain', axis=1)
    _df['snow'] = _df.rolling(8, min_periods=1)['snow'].mean()
    _df['snowing'] = (_df.snow > 0).astype(np.int)
    _df = _df.drop('snow', axis=1)
    _df['invisb'] = _df.visb.max() - _df.visb
    _df = _df.drop('visb', axis=1)

    # complete preprocessing
    _df = dummies_from_file(_df, 'dummies.json')
    _df = norm_from_file(_df, 'transformations.json')
    if not return_df:
        _df = _df.values
    return _df

Sanity check

To test if my functions are well implemented, I get the raw values from the csv file and I perform the preprocessing on this matrix.

In [106]:
df = pd.read_csv("SeoulBikeData.csv")
df = df.drop(['Rented Bike Count', 'Functioning Day'], axis=1)
values = df.values
values
Out[106]:
array([['01/12/2017', 0, -5.2, ..., 0.0, 'Winter', 'No Holiday'],
       ['01/12/2017', 1, -5.5, ..., 0.0, 'Winter', 'No Holiday'],
       ['01/12/2017', 2, -6.0, ..., 0.0, 'Winter', 'No Holiday'],
       ...,
       ['30/11/2018', 21, 2.6, ..., 0.0, 'Autumn', 'No Holiday'],
       ['30/11/2018', 22, 2.1, ..., 0.0, 'Autumn', 'No Holiday'],
       ['30/11/2018', 23, 1.9, ..., 0.0, 'Autumn', 'No Holiday']],
      dtype=object)
In [107]:
df = preprocess(values,  return_df=True)
values = preprocess(values,  return_df=False)

Then I compare the X I had on the part 5.4 Train-test split with the matrix of values from raw for :

  • the same columns order
  • the same values for each sample
In [108]:
if sum([X_col != df_col for X_col, df_col in zip(X.columns, df.columns)]):
    print('Columns are not in the same order')
else:
    print('Columns are in the same order')


err = X.copy
for col in X.columns:
    a, b = X[col], df[col]
    err = (a - b).mean()
    if err != 0:
        print(f'Error for column {col}:{err}')
Columns are in the same order
Error for column dryness:-6.300452845048238e-05

Columns are in the same order and are the same. All the features are also the same except for the dryness but it can be cause by an approximation computation. Everything is clean.

Server side

Then remains the server side where the model is deployed. From the client side, it will be asked to resquest the API (host on local host for now) from the '/predict' route with the json matrix of features to predict.

Here the python code from the deployed_model.py file:

# Import libraries
import json
import pickle
import numpy as np
from flask import Flask, request, jsonify

app = Flask(__name__)
# Load the model
model = pickle.load(open('rf_model.pkl','rb'))
with open('transformations.json') as f:
     transformations = json.load(f)
     mean, std = transformations['n_bike']['mean'], transformations['n_bike']['std']


@app.route('/predict',methods=['POST'])
def predict():
    # Get the data from the POST request.
    data = request.get_json(force=True)
    # Make prediction using model loaded from disk as per the data.
    inputs = np.array(data['inputs'])

    prediction = (model.predict(inputs) * std + mean) ** 2

    return jsonify(list(prediction))

if __name__ == '__main__':
    app.run(port=5000, debug=True)

Sanity check

We can run this server from terminal running the python deployed_model.py or python3 deployed_model.py according to your OS.

I can first check I can run the request without any issue as a sanity check

In [112]:
import requests

url = 'http://localhost:5000/predict'

def serialize(df):
    return [[value for value in row] for row in df.values]
     

r = requests.post(url, json={'inputs': serialize(df)})
In [ ]:
The model is runi
In [113]:
pred = np.array(r.json())

pred_true = grid.best_estimator_.predict(df.values)

err = pred - target_unfit(pred_true)
err.mean()
Out[113]:
0.0

Exporting notebook

In [114]:
%%capture

!jupyter nbconvert Axel_THEVENOT_Python_For_Data_Analysis.ipynb --to HTML --template classic
!jupyter nbconvert Axel_THEVENOT_Python_For_Data_Analysis.ipynb --to HTML --template classic --no-input --output "Axel_THEVENOT_Python_For_Data_Analysis_report.html"